Approximate string matching as an algebraic computation
نویسنده
چکیده
Approximate string matching has a long history and employs a wide variety of methods (see e.g. the survey [2]). We consider a variant of approximate matching that compares a fixed pattern string to every substring in the text string by a rational-weighted edit distance (e.g. the indel distance, defined as the number of character insertions and deletions, or the indelsub/Levenshtein distance, where character substitutions are also allowed). By a simple transformation of the pattern and the text, the problem can be reduced to computing the longest common subsequence between the pattern and every substring of the text. This generic form of approximate matching captures many different problems that have been considered in the past. For such problems, ad-hoc dynamic programming algorithms have typically been designed; a recent example, motivated by modern genome sequencing technologies, is given by [1]. We show that many of these specialised solutions can be unified, and often improved or generalised, by expressing the approximate matching problem in the language of abstract semigroup algebra. Our approach creates a powerful alternative to standard dynamic programming, allowing the computation to be performed independently and simultaneously on different parts of the pattern and the text. As a result, our method provides efficient solutions in situations where this extra independence can be exploited: in particular, approximate matching on compressed strings, parallel string comparison, and local comparison of genome sequences. This paper references our recent work [3]. BODY Matching a pattern approximately to every substring in a text = computing in the classical braid group, where crossings are made idempotent.
منابع مشابه
On Approximate String Matching of Unique Oligonucleotides
The current research considers the approximate string matching search for important subsequences from DNA sequences, which is essential for numerous bioinformatics computation tasks. We tested several approximate string matching algorithms and furthermore developed one for DNA data. Run times of the algorithms are important, since the amount of data is very large.
متن کاملRestricted Transposition Invariant Approximate String Matching Under Edit Distance
Let A and B be strings with lengths m and n, respectively, over a finite integer alphabet. Two classic string mathing problems are computing the edit distance between A and B, and searching for approximate occurrences of A inside B. We consider the classic Levenshtein distance, but the discussion is applicable also to indel distance. A relatively new variant [8] of string matching, motivated in...
متن کاملO(k) Parallel Algorithms for Approximate String Matching Approximate String Matching (proposed Running Head)
Given a text string T of length n, a shorter pattern string A of length m, and an integer k, an simple straightforward O(k) parallel algorithm for nding all occurrences of the pattern string in the text string with at most k di erences (as de ned by edit distance) is presented. The algorithm uses the priority CRCW-PRAM model of computation and (n m+ k + 2) m = O(n m) processors. Over recent dec...
متن کاملFast Convolutions and Their Applications in Approximate String Matching
We develop a method for performing boolean convolutions efficiently in word RAM model of computation, having a word size of w = Ω(log n) bits, where n is the input size. The technique is applied to approximate string matching under Hamming distance. The obtained algorithms are the fastest known. In particular, we reduce the complexity of the Amir et al. [1] algorithm for k-mismatches from O(n √...
متن کاملAlignments and String Similarity in Information Integration: A Random Field Approach
Several problems central to information integration, such as ontology mapping and object matching, can be viewed as alignment tasks where the goal is to find an optimal correspondence between two structured objects and to compute the associated similarity score. The diversity of data sources and domains in the Semantic Web requires solutions to these problems to be highly adaptive, which can be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- TinyToCS
دوره 1 شماره
صفحات -
تاریخ انتشار 2012